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Abstract 

We analyze the problem of sequential probability assignment for binary outcomes with side information and 
logarithmic loss, where regret—or, redundancy—is measured with respect to a (possibly infinite) class of experts. 
We provide upper and lower bounds for minimax regret in terms of sequential complexities of the class, intro¬ 
duced in [14, 13], These complexities were recently shown to give matching (up to logarithmic factors) upper 
and lower bounds for sequential prediction with general convex Lipschitz loss functions [ 1, 12], To deal with 
unbounded gradients of the logarithmic loss, we present a new analysis that employs a sequential chaining tech¬ 
nique with a Bernstein-type bound. The introduced complexities are intrinsic to the problem of sequential prob¬ 
ability assignment, as illustrated by our lower bound. 

We also consider an example of a large class of experts parametrized by vectors in a high-dimensional Eu¬ 
clidean ball (or a Hilbert ball). The typical discretization approach fails, while our techniques give a non-trivial 
bound. For this problem we also present an algorithm based on regularization with a self-concordant barrier. 
This algorithm is of an independent interest, as it requires a bound on the function values rather than gradients. 


1 Introduction 

In this paper we study the problem of sequential prediction of a string of bits (jq,..., y n ) = jq : „ e {0, 1}”. At each 
round t - 1 the forecaster observes side information x t e 3C t , decides on the probability y t e [0,1] of the 

event y t - 1, observes the outcome y t e [0,1}, and pays according to the logarithmic (or, self-information) loss 
function 

U yt , yt) = -i {y = 1 } log y t -1 {yt = 0 } logd - y t ). 

At each time instance t, the side-information set 3f>\ is a subset of an abstract set SC. The subset SCt is allowed to 
depend on the history hi-.t-i = (xnt-i.yiu-i). and the functions SCt : [SC x —>• 2 : ' /: are assumed to be known 

to the forecaster. 

The goal of the forecaster is to predict as well as a benchmark set & of functions—sometimes called “experts”— 
mapping SC to [0,1]. More specifically, the goal is to keep regret 

XA(yt.yt)- inf {TU/Utbyt) 

f=l fa&t=\ 

as small as possible for all sequences yi,...,y n and x\,...,x n (satisfying x t e SCt 1 ))■ 

To illustrate the setting, consider a few examples. We may take SCtihi-.t-i) = {(yi.---.yt-i)} c {0,l} f_1 to be 
a singleton set containing the exact realization of the sequence so far. In this case, the choice x t = (yi,...,yt-i) 
is enforced and f{xf) — p /Uil'i,• • •, yt- 1 ) may be viewed as a conditional distribution; the normalized maximum 
likelihood forecaster is known to be minimax optimal in this extensively studied scenario (e.g. [4, Ch. 9]). Alterna¬ 
tively, we may define 3Ct{h\ :t -i) — \y' e {0, l}* -1 : du(y\ l -\,y') < r} to be a set that contains histories with up to r 
flips of the bits. In this case, the forecaster is facing a situation where history can be slightly altered in an adversar¬ 
ial fashion. As another example, we may take SC t {h\-. t -i) = ■ ■ ■, Tf-i)}. in which case the forecaster competes 
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with a set of fcth-order stationary Markov experts. The set SCt may also be time-invariant, in which case / is a 
memoryless expert that acts on side information. In short, the formulation we presented subsumes a wide range 
of interesting problems. Our goal in this paper is to understand how “complexity” of S' affects minimax rates of 
regret. 

The minimax regret for the problem of sequential probability assignment can be written as 

U n f n n 

sup inf sup Ey,~p t || \ XyCKf.Jh)- inf £ 

x t ear t («:t-i,yi:r-i) y f e[0,l] p t e[0,l] // t =l l t=l f£&t= 1 

where E y ,~p t is a shorthand for the expectation with respect to Bernoulli y t with bias p t . Following [10], the nota¬ 
tion ((...)) ” =1 represents a repeated application of the operators inside the brackets and corresponds to the unrolled 
minimax value of the associated game between the forecaster and Nature. Any upper bound on V n (•'?) guarantees 
existence of a strategy that attains regret of at most that amount. In the last few years, new techniques with roots in 
empirical process theory have emerged for analyzing minimax values of the form (1). We bring these techniques 
to bear on the problem of sequential probability assignment with self-information loss. 

Our point of comparison will be the study of rich classes in [4, Section 9.10]. Following [4], we employ the 
truncation method to deal with the unbounded loss function. To this end, fix 8 e (0,1/2), to be chosen later. For 
a e [0,1], let t gia) denote the thresholded value 



e{f{x t ),y t )\ ( 1 ) 


f 8 if a < 8 

Tg(a) = la if a e [5,1-5] 

1 1 - <5 if a > 1 - <5. 

For a class S' , let S' 5 = {T§{f) : f e S'] denote the class of truncated functions. It is easy to check (see [4, Lemma 
9.5]) that 


V n (S?)<Vn(S? 5 ) + 2nS, (2) 

and we can, therefore, focus on the mini max regret with respect to S' 5 . We show that V n {S^ s ) can be upper 
bounded via a modified (offset) sequential Rademacher complexity, which in turn can be controlled via sequen¬ 
tial chaining in the spirit of [11, 12]. Unlike the latter two papers, however, we do not employ symmetrization 
and instead use the self-information property of the loss function. We are able to mitigate the adverse depen¬ 
dence of V n {S^ s ) on 8 by introducing chaining with Bernstein-style terms that control the sub-Gaussian and sub¬ 
exponential tail behaviors. As an example, we recover the n 315 rate for monotonically increasing experts presented 
in [4, Sec 9.10-9.11]. However, our technique goes well beyond such examples of “static” experts. In particular, 
we can obtain non-trivial rates even in the setting where discretization in the style of [4, Sec 9.10-9.11], [3] leads 
to vacuous bounds. One such example is when experts are indexed by a unit ball in a Hilbert space (or, a high¬ 
dimensional Euclidean space) and expert’s prediction depends linearly on side information. A discretization in 
the supremum norm of this set of experts is not finite, and thus the typical approaches to this problem fail. In 
contrast, we employ the ideas from empirical process theory and its sequential generalization in [14] in order to 
define “data-dependent” notions of complexity. 

Despite the improvement over the technique of [4], the rates attained in this paper are not always minimax op¬ 
timal, as we demonstrate in Section 6. This is in contrast to other loss functions (such as absolute, square, <r/-power, 
and logistic) for which matching upper and lower bounds (to within logarithmic factors) have been established re¬ 
cently in [12]. As mentioned in [4], the truncation method is crude, and we leave it as an open question whether a 
different technique can be employed to attain optimal rates. 

We finish this introduction with a brief mention that sequential probability assignment is extensively studied 
in Information Theory, where regret is known as redundancy with respect to a set of codes. The vast literature 
mostly investigates the case of parametric classes (see [17, 18, 5, 15,16] and the references in [4, Ch. 9]), with exact 
constants available in certain cases. We refer to [7] for a discussion of approaches to dealing with large comparator 
classes. Given the well-known connection to compression, it would be interesting to employ the relaxation-based 
algorithmic recipe of [9, 12, 10] to come up with novel data compression methods. 
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2 Complexity of Large Classes of Experts 

We focus on the mini max value for the thresholded class S 5 . To state the first technical lemma, we need the 
definition of a tree. For an abstract set 2, a ,2f-valued complete binary tree z of depth n is a collection of labeling 
functions z t : {0, l}* -1 —► 2 for t £ {1,..., n}. For a sequence y = (yi,... ,y n ) e {0, 1}” (which we call a path), we write 
z f (y) for z,(yi,...,y ( _i). Once we take y\,...,y„ to be random variables, we may view |z ( } as a predictable process 
with respect to the filtration given by cr(yi,..., y f _] ). 1 

We will say that an x tree is consistent with respect to the side information set mappings h\ :t - 1 i -> 3£t \) if 

for any y e {0,1}", it holds that for all t, 

Xf(y) £ ■5Kf(xi(y),...,x f _i(y),y 1 ,...,y f _i). 

A consistent tree respects the sets of constraints 3F t imposed by the problem. For the purposes of analyzing com¬ 
plexity of S, it is important that the constraints are reflected in the tree x. 

Theorem 1 below relates the minimax regret with respect to S s to the supremum of a stochastic process of 
a form similar to offset Rademacher complexity introduced in [11]. The key difference with respect to [11] is that 
the stochastic process is defined with potentially biased coin flips. To prove Theorem 1, we avoid symmetriza- 
tion and instead exploit the fact that the logarithmic loss has the self-information property: in the maximin dual, 
the optimal probability assignment is given precisely by the distribution of the y t variable. We note that the sym- 
metrization approach of [11] appears to give worse rates for the logarithmic loss function. 

Let 


rj[p, a) = -1 {a- 1} p 1 + 1 {a = 0} (1 - p) 1 (3) 

and observe that rj is zero-mean if a is Bernoulli random variable with bias p. 

Theorem 1. The following upper bound holds: 


V n {S ) < sup E sup 

x,fJ,p f^S 


E 


f:pt(y)e[5,l-5] 


V(pdy),yt) [n t (y) - /(x f (yll) - - [p t (y) - /(x f (y))) 


+ 2ndlog(l/<5), 


where p, p range over all [0,1] -valued trees, x ranges over consistent trees, and the stochastic process y\,...,y n is 
defined viay t \yi,...,y t -i ~ Bernou 11 i Cpr (yi, - - -, yr- i )) - 

To shorten the notation in Theorem 1, let 2 - SF x [0,1] and for every / e S 5 , write gf[z) = gf(x, a) — a-fix). 
The upper bound of Theorem 1 can be written more succinctly as 


V n {& 5 ) < supE sup 

z,p 


E ?7(p f(y).y t )g/(zt(y)) 

f:p t (y)E[<5,l-<5] 


1 

-g/(z r (y)) 


2 


+ 2n51og(l/d). 


(4) 


We keep in mind that the x part of z is a consistent tree. Observe that the expression above is a supremum of a 
collection of random variables indexed by / e S s , each with a nonpositive-mean. To analyze the supremum of 
this stochastic process, we first consider the case when the indexing set is finite. 

Lemma2. For any setV consisting of [-1,1] valued trees, any [8,1-8]-valued tree p, and any c> 0, 


E v max 

1 veV 


n 


E 


?7(pt(y)>yf)v t (y)-cv f (y) 2 


r log IVI 

“ dlog(l+ §) 


wherey t |yi,..., yt- i ~ Bernoulli(p t (yi, ■ ■ ■, yt-i))- Furthermore, the same upper bound holds if p is any [0,1] -valued 
tree but the summation is restricted to{t: ptiy) e [5,1 - d]}. 

1 We remark that in [14, 13, 10], the trees are defined with respect to {± l}-valued sequences, whereas here we use the {0, l[-valued variables. 

The change is purely notational and all the definitions and results can be rephrased appropriately. 
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The tight control of the expectation is possible because of the negative quadratic term that acts as a com¬ 
pensator. On the downside, the upper bound displays the adverse 1/(5 dependence. We now show a maximal 
inequality when the quadratic term is not present. The bound is of a Bernstein type, with the sub-Gaussian and 
sub-exponential behaviors. Crucially, the sub-Gaussian term scales with 1 iVS. 

Lemma 3. For any set V consisting of [-1,1] valued trees, any [8,1- 8]-valued tree p, and any c> 0, 


E v max 

7 veV 


Lt?(Pf(y),yf) v f(y) 

t =i 


n\og\V\ 2v max \og\V\ 
< 5 v\l -r-+ ■ 


where y t \yi,...,y t -i ~ Bernoulli(p f (yi,...,y t _i)), v = max v£ i/max y (±2;" =1 v f (y) 2 ) 1/2 , andv max - max vel /max y |v f (y)|. 
The same upper bound holds ifp is any [0,1] -valued tree but the summation is restricted to {t: p, (y) e [<5,1 - 5]}. 

We now pass from a finite collection to an infinite one via the sequential chaining technique [14]. For this 
purpose, we recall the definition of £ p sequential covering numbers. 

Definition 1 ([14]). A set V of R-valued trees of depth n is a (sequential) y-cover (with respect to ( p , p > 1) of 
<3 c U 2 on a 2 -valued tree z of depth n if 

II n \ llp 

\/gE<S, ye{ 0,1}”, 3v e V, s.t. - E |v f (y) - g(z t (y))l p < y. (5) 

\ n t=i I 


The size of the smallest y-cover is denoted by JFpFS ,j ,z). For p — oo, (5) becomes max f |v f (y) - g(z f (y))l < y. 

Theorem 4. Let <3 be a class of functions 2 —► [-1,1]. For any [0,1 ]-valued treep, any 2-valued treez, any K > 0, 
andy> 0, 


Esup 


E t/(Pf(y).yt)g r (zdy))-^g(z t (y)) 2 

f:p t (y)e[5,l-(5] 


1 log ^o(^,y,z) . 

< 7- f -+ mf 

8 log(l+ ¥ ) a( 0 ,y] 


4 na 


+ 30 


12n r 


\Jlog jYooFS^.z) dp + ^ J log^Voo(<3,p, 


z )dp 


where the stochastic process yi,. ..,y n is defined viay t \y\,...,yt-\ ~ Bernoulli(p r (yi,.- - ,yt-i))- 
Theorem 4 is readily applied to the upper bound of Theorem 1 by identifying 

‘S = [g/(z) = gf{x,p) — p — f (x) : / e & s ,p e U,x e 3F} 
and z f (y) = (Xf(y),/r t (y)). It is immediate from the definition of a cover that for any p, x, and z = (x, p), 


jV p {&°,x,a) -JF p FS,z,a). 


( 6 ) 


The lower bound of Lemma 10 (presented in Section 7) and the relation between the offset Rademacher com¬ 
plexity and sequential fat-shattering dimension [14, 12] yield the next theorem. 

Theorem 5. For the case of constant sets 9C\ — 3^2 — ■ ■ ■ — 3C, the following are equivalent: 

• Minimax regret is sublinear: V n (2) 0 as n oo 

• Sequential dimension fat^j (2,3F) is finite for all /J > 0 

Let us make a few remarks. First, the theorem can be easily extended to non-constant sets 3F t , in which case 
fat^j is defined with respect to consistent trees (as in the next section). Second, one may also phrase the equivalence 
through sequential covering numbers, thanks to the relations outlined in [14, 12]. 

In summary, the sequential complexities we study are intrinsic to the problem of sequential probability assign¬ 
ment (unlike, for instance, covering numbers with respect to the supremum norm on SC — see Section 4 for an 
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example). Yet, the upper bounds we derive do not quite match the lower bounds, due to the hard thresholding 
approach and the need to balance n8 with V n at the end of the day. It is an open problem to close the gap 
between the upper and lower bounds. 

The upper bound of Theorem 4 is quantified as soon as we have control of sequential covering numbers. While 
covering numbers could be computed directly in many situations, it is often simpler to upper bound a “scale- 
sensitive dimension” of the class, defined in the next section. In Section 5 we present an example of such a simple 
calculation. 


3 Covering Numbers and Combinatorial Parameters 


Suppose we can define a preorder < on the set SC (that is, a binary relation that is reflexive and transitive). We 
say that an 3C-v alued tree x of depth n is ordered if for any path y e {0,1}”, it holds that x f (y) < x f +i(y) for all 
t = 1,...,«- 1. In this section we show that the combinatorial dimensions, covering numbers, and the associated 
upper bounds in [14] can be extended to “respect” the preorder (of course, one can always define a vacuous relation 
< and recover prior results). 

Definition 2. A class & c shatters (at scale (’> > 0) an ordered ^“'-valued tree of depth d if there exists a R-valued 
witness tree s of depth d such that 

Vy e { 0 , l} d , 3/ e S', s.t. (2y t -l)(/(x t (y))-s t (y))>^/2. 

The largest depth of an ordered ^'-valued tree is denoted by fat 0 p{&,3£'), where the superscript o stands for “or¬ 
dered". 


The notion of the Littlestone’s dimension Ldim(^,^T') for [0,..., fc}-valued function classes extends in exactly 
the same way to the case of ordered trees. 

The main step in obtaining upper bounds on sequential covering numbers is the analogue of the Vapnik- 
Chervonenkis-Sauer-Shelah lemma, proved in [14, 13], We now show that if we ask for a /3-cover on an ordered 
tree x, the sequential covering numbers are controlled via the ordered version fat^(^, lmg(x)) of the fat-shattering 
dimension in Definition 2. 


Theorem 6 (Extension of Theorem 4 in [14]). Let SP c {0,..., k\ :/ be a class of functions with fat£ [S’, SC) = d. Then 
for any n> d and any ordered SC-valued treex, 


•X x,(^,l/2,x)<£ 
i =0 


< n ' 


\ l i 


Hence, for a class Q [-1,1 ] :r , for any (3 > 0, 


^oC^.^X) < 


|2enj 


The following three sections are devoted to particular examples. We start by exhibiting a simple class for which 
sequential covering numbers are small, yet the discretization with respect to the supremum norm (typically per¬ 
formed to appeal to a finite-experts method) gives vacuous bounds. 


4 Example: Consistent History 

We would like to illustrate that sequential covering number can be much smaller than covering numbers with 
respect to the supremum norm over SC. Consider the particular case of SC t {hi :t -\) = {(yi,..., yf-i)}. Clearly, there 
is only one consistent tree, namely the one defined by x t (y) = (yi,..., yt- i ) for any t. In this case, the requirement 
(5) in Definition 1 with class , consistent tree x, and p - oo reads as 

V/e^ 5 , y e [0,1}”, 3v e V, s.t. |v f (yi:f_i)-/(yi :f _i)|<y. (7) 
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We contrast this with the definition in [4, Sec. 9.10], where the covering of S is done with respect to the following 
pointwise metric (which we normalized by fn for uniformity): 


d(f,g) = \ ~ ^sup(^(/(yi :f _i),y t )-f(g(yi :f _i),y t )) 2 . (8) 

V n t=\ yi-.t 

To illustrate a gap in the two covering-number approaches, construct a particular class S as follows. For each 
element b e {0,1}”, define fb by 

/fo(yi:t-l) = = yt:t-l}+ “ 

and take S - {fb'- b e {0,1}”}. In other words, on round t, expert f), predicts probability 1/2 if history coincides 
with bi-j-i, and 1/4 otherwise. For two elements fb , fb’ £ S, let x(b, If ) — max{f : b t - b ' t } be the last time the two 
sequences agree (defined as 0 if b\ f b'f). Then 

E SU P(^/&(yi:r-i)> yt) -^(/y (yi:t-t). yt)f s E 1) - ^(/y (hit-i), D) 2 >{n-K(b,b')) log(2) 2 

f=l yi:t t=l 

and thus there are at least 2 nl 2 functions at a constant distance d(f, g) > c. 

In contrast, consider sequential covering in the sense of (7) (and Definition 1). Take any y £ {0,1}” and fb £ 
S ’. The sequence of n values (/i,(0),/b(yi),...,/i,(yi:t-i),...,/b(yi:n-i)) is equal to 1/2 until t - x{b,y) and 1/4 
afterwards. Let V be a set of n trees v 1 ,... ,v n labeled by {1/4,1/2}. Each v ! is defined as 

Vy £ {0,1}", t £ {1,..., n}, v'(y) = (1/4)1 {t < i - 1} + 1/4. 

It is immediate that this set of n trees provides an exact cover of S (at scale 0) in the sense of Definition 1. This 
leads to (S’(log {ri)ln) bounds on minimax regret, while the discretization with respect to the supremum norm (8) 
fails. 

The above failure is endemic to approaches that attempt to discretize the set of experts before the prediction 
process even started. In contrast, sequential complexities can be viewed as an analogue of “data-based” discretiza¬ 
tion, which is known in statistical learning since the work of Vapnik and Chervonenkis in the 60’s. 


5 Example: Monotonically Nondecreasing Experts 

We consider an example of a nonparametric class analyzed in [4, p. 270]. Let / e S be a set of experts such that 
the forecasted probability does not decrease in time. To model this scenario in a general manner, we suppose that 
the side information x t - ( t,x' t ) eNx ,yi :: /-i) contains the time stamp, and f{t+ 1, x') > fit, x'f) for any 

/ £ S. The particular case of static experts—with prediction depending only on t and no other side information— 
has been considered in [4]. 

To invoke the results of the previous section, define a preorder on [t,x) e 3£ = Mx SC' according to the time 
stamp: (t, u) < (s, v) for any t < s and u, v £ SC'. Suppose an ordered ,3T-valued tree x of depth d is shattered, 
according to Definition 2, with a witness tree s. We claim that the values of the witness tree must be increasing by 
at least f along the path y - (1,1,1,...). Indeed, consider any t > 1, and let y' = (yi : t,0,y f+ 2 : d)- By the definition 
of shattering, there must be a function that satisfies /(x t (y')) > s t (y') + ft 12 and /(x f+ i(y')) < s f+ i(y') - f/2. Since 
/(x^(y , )) £ /(Xf+i(y')), we conclude that s t {y) = s t (y') < Sf+i(y') - (5 — Sf+i(y) - f). Hence, s t increases by at least ft 
along the path (1,..., 1) and thus d < \l/3. This quick calculation gives ia.t°p(S,3C) < 1//3. 

In view of Theorem 6, 

log-WooG^E/kx) < (1//3) log(2ett//l) 

In view of (6), the same covering number estimate holds for C S. Then Theorem 4 with a - 1/n and y - n~ a (with a 
to be determined later) implies that 

f \ogJf 00 { c S } p } z)dp < Clog 2 n 
Ja 
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is a lower order term, with C being an absolute constant. We also have 


PY _ _ 

J \JlogjV 00 (^,p,z]dp<C'yJlogn-j 112 . 

Now, ignoring constants and logarithmic terms, this gives the overall rate of 

for the mini max regret with respect to £P S . The terms are balanced by choosing y = n 1 /3 d” 1,3 . The rate with 
respect to & is then 

0* [nS + n ll3 d~ 213 ) = 0* [n 315 ) 

by choosing S — n~ 215 . This corresponds to the rate obtained by [4], 


6 Example: Linear Prediction 

In this section we consider the special case of 9C\ — ... — 3C n =3C -B^ and 

& = {fix') = {(w,x) + \) 12 : w e B 2 } (9) 

where B 2 is a unit Euclidean (or Hilbert) ball. Written as a function of w, the loss at time t is (up to an additive 
constant log(2)) 


gt(w) = -l{y t = l}log(l + (w,x t )) -1 {y t = 0}log(l - (w,x t )). (10) 

It is possible to estimate the sequential fat^ dimension of a unit Hilbert ball as faty = 0* (1 //S 2 ), where the 0* 
notation ignores logarithmic factors. Then Theorem 4 gives an upper bound of 

V n (& s )=0* (« 1/2 d _1 ), 

and thus 

V n m = 0*[n 314 ). 

Below, we exhibit an algorithm that attains regret of 0* [n 1,2 ) t implying that the upper bounds obtained with our 
technique are not always tight. 

6.1 Algorithm: Regularization with Self-Concordant Barrier 

To develop an algorithm for the problem, we turn to the field of online convex optimization. We observe that 
functions g t defined in (10) are convex, but not strongly convex. Moreover, the gradients of g t ( w) are not bounded. 
We may consider a restricted set to mitigate the exploding gradient; however, a d-shrinkage of the ball B 2 still leaves 
the gradient to be of size 0(1 IS). A direct gradient descent method will give the suboptimal 0(/r 3/4 ) upper bound 
derived above in a non-constructive way. We also mention that while the functions are exp-concave, the upper 
bounds for the Online Newton Step method [6] scale with the dimension of the space, which we assume to be large 
or infinite. 

We now present an algorithm based on self-concordant barrier regularization, which appears to be of an inde¬ 
pendent interest. The algorithm answers the following question: can one obtain regret bounds for online convex 
optimization in terms of the maximum of function values rather than gradients? 

Consider the Follow-the-Regularized-Leader method 

t 

w t + i=argmin £ (Vg s (u/ s ), w) + t]~ l R(w) (11) 

weB 2 S=1 
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with the self-concordant barrier R(w) = - log(l - || w\\ 2 ). In accordance with the protocol of the probability assign¬ 
ment problem, we predict (w t ,x t ) at round t after observing x t . It is shown in [2] that regret of (11) against any 
w* e B 2 is 


n n 

Y - gt(w*) ^ 277 Y \\Vgt(w t )\\*w t + (12) 

t =1 t= 1 

as long as 77 satisfies 77 1| S! g t {w t ) || * Wf < 1/4. Here, the local norm is defined as 

According to the lemma below, the local norm is bounded by a constant that is independent of the dimension: 
Lemma7. For any t, the local norm of V gtiwt) is upper bounded by a constant: 

\Wgt(w t )\\* Wt <3. 

Together with (12), Lemma 7 implies a regret bound of I 87777 -(- 77 “ 1 R(w*). Instead of taking w* at the boundary 
of the ball where R(w*) is infinite, we can evaluate regret against w- (1-1 ln)w*. For such a comparator, R(w) - 
0{\ogn). By choosing rj appropriately and using an argument similar to (2), we conclude that regret against any 
w* e B 2 is upper bounded by 

C\Jnlogn. 

Importantly, C is an absolute constant that does not depend on the dimension of the problem. This rate is opti¬ 
mal up to polylogarithmic factors. The optimality follows from Lemma 10 below and an estimate on sequential 
covering number of a Hilbert ball [11, 12]. 

Lemma 8. For the linear class in (9), 

V n (in = ®*(« 1/2 ). 

The proof of Lemma 7 relied heavily on the ability to calculate the gradient of the loss function and match 
it to the inverse Hessian of the self-concordant barrier. We now give an alternative proof based on a simple and 
charming, yet unexpected lemma due to Nesterov (see Appendix for the short proof): 

Lemma 9 (Lemma 4 in [8]). Let yr be concave and positive on int JF. Then for any x e int JF wehave 

l|Vi//(x)|| * < y/[x). 

The lemma allows us to upper bound regret in an online convex optimization problem if we only know that 
the values of the functions (and not the gradients) are bounded. Consider the FTRL algorithm (11), but over the 
shrunk ball (1 - 1 / n)B 2 . Suppose we can ensure 0 < g t < A. Then A - g t is concave and positive. Hence, by above 
lemma 

\\Vgt(w t )\\* Wt = l|V(A-gf(w f ))||^ t < A-g t (w t ) < A 
which provides an alternative to the bound of Lemma 7. Regret is then upper bounded by 

n n 

Y, St( w t) - Y Stitu*) < 2r]nA 2 + p^Rlw*) 
t= 1 f=i 

Crucially, by employing self-concordant regularization, we avoid paying for a large gradient of cost functions at the 
boundary of the set. Over the shrunk set (1 - Hn)B 2 , we ensure that the values of functions g t are upper bounded 
by A - O(logn) even if the gradients blow up linearly with n. This surprising observations leads to a dimension- 
independent 0{\/n\ogn) regret bound for the Euclidean ball, and can also be used for other convex bodies and 
non-logarithmic loss functions when the closed-form analysis of Lemma 7 is not available. 
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7 A Lower Bound 


In this section, we show that the offset sequential Rademacher complexity serves as a lower bound on the minimax 
regret. Hence, the complexities of the class & of experts are intrinsic to the problem. We refer to [12] for further 
lower bounds on the offset Rademacher complexity via the scale-sensitive dimension and sequential covering 
numbers. 

Lemma 10. The following lower bound holds: 

V n (.9) + 1 > supE y sup | Y 2(2y t - l)(/(x t (y)) - 1/2) - 4(logw)(/(x t (y)) - 1/2) 2 ! 

x [/e & lln U=i > 


where yi,...,y n are independent with distribution Bernoulli(l/2) and the supremum is taken over consistent trees 
with respect to constrain ts X t . 

Proof of Lemma 10. To prove the lower bound, we proceed as in [12]. First, we observe that (2) holds in the other 
direction too: 


V n (& s )<V n m + n8. 


(13) 


To see this, note that any / only loses from thresholding when either /(x f ) >1-5 and y t = 1, or when f(x t ) < S 
and y t - 0. In both cases, the difference in logarithmic loss is at most - log(l - 5) < 6 for S < 1/2. For the purposes 
of a lower bound, we take 8 - II n and turn to lower-bounding V n (.^ l/n ). 

As in the development leading to (25) in the proof of the upper bound, the minimax value V n {S ' 1 / "j is equal to 


sup sup Ey f 
x t p t e[0,l] „ (=1 


sup 1 ]T inf E yt [£{y t ,y t )] ~ ^^(/(^t),y f ) 

fe& lln [ t=l y t e[0,l] t=l 


which, by the self-information property of the loss equal to 


sup sup Ey f 
x, Pt E[0,l] ' „ f=1 


sup I Y Ey t [£{p t ,y t )] - Y Vif&tLyt) 

fz&llii l f=l t= l 


By the linearity of expectation (and since the terms Ey t [£{p t ,y t )\ do not involve/), we have 


= ((sup sup Ey t 

x t PfE[0,l] ’ ti t=l 


SU P jL^(Pr>yf)-L^(/Cxt),yr) 

l f=l f=l 


We now pass to the first lower bound by choosing p t - 1/2 for all t. 

Consider the case y t - 1 and expand the loss function around p t - 1 12 for z e [IIn, 1]: 


(14) 


(15) 


(16) 


^(1/2,1) - £{z, 1) = -log(l/2) - (-log(z)) = 2(z- 1/2) - R(z) 


(17) 


where R (z) is the remainder. We claim that the remainder can be upper bounded by a quadratic over the interval 
[1 In, 1]. To this end, consider the function 

g(z) = -2z + (1 + log(2)) + 4 (log ri)(z- 1/2) 2 


and note that the derivative and the value of this function at 1/2 coincide with the derivative and the value of 
-log(z) at the same point. We claim that g(z) dominates -log(z) on [1/n, 1]. For z > 1/2, this follows from 
g' > (-log)'. The same argument holds for the interval ]l/log(n), 1/2]. Now, at z = 1 In, g(z) > -log(z) and 
|g'(z)| < |log(z)'|. The derivative relation continues to hold on the interval [1 In, cl login)] for large enough c, es¬ 
tablishing g > - log on this interval too. The remaining interval ]c/log(n), 1 / log(«) | is easily checked by the direct 
computation of function value. In sum, the remainder in (17) can be upper bounded by R(z) < 4(log?i)(z- 1/2) 2 . 
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The case of y t — 0 is exactly analogous, and we obtain 

n 

- E >2 [\{y t = 1} (f{Xt) -l/2) + l {y t = 0} + 1/2)] -4(logrc)(/(x t ) - 1/2) 2 

t= 1 


= 2 [y r (/(x f ) - 1/2) + (1 - y f )(-/(x f ) + 1/2)] -4(logn)(/(x t ) - 1/2) 2 
= 2(2y t - 1 )(/(x f ) - 1/2) - 4(logrc)(/(x t ) - 1/2) 2 . 

The lower bound in (21) then becomes 


V n [& lln )> 


sup 


E 


= sup E y 


yt 


f=i 


sup \ E 2(2 ^ - WU t) - 1/2) - 4(log n)[f{x t ) - 1/2) 2 
/ejr+n w=l 


sup l E 2(2y t - l)(/(Xf(y)) - 1/2) - 4(logu)(/(x f (y)) - 1/2) 2 

U=1 


where yi,...,y„ are independent with distribution Bernoulli(l/2). 


( 18 ) 

(19) 

( 20 ) 

( 21 ) 

( 22 ) 

□ 


8 Discussion and Open Questions 

At the very first step, the analysis in this paper thresholds the class S to avoid dealing with the exploding gradient 
of the loss function. The authors believe that this “hard thresholding” approach is the source of sub-optimality, and 
that “smooth” approaches should be possible. When the class of functions has a specific structure, such as in the 
example of Section 6, the exploding gradient can be mitigated in a “smooth way” by a regularization technique. It is 
not clear to the authors how to perform the "smooth thresholding” analysis when such a structure is not available. 

Another interesting venue of investigation is the development of algorithms. It has been shown that the mini¬ 
max analysis, of the type performed in this paper, can be made constructive [9, 12, 10]. It appears that the relaxation 
approach may yield new (and possibly computationally efficient) methods for sequential probability assignment 
and data compression. 


A Proofs 


Proof of Theorem 1. Let us use the shorthand @ = [5,1-5]. The value V n {S) can be upper bounded by 


sup inf sup E y t ~p, 

Xt y t e@p f e[0,l] 


t= 1 


inf 

f=l fa&B t= \ 


(23) 


simply because each infimum is taken over a smaller set. Henceforth, it will be understood that x L ranges over 
■SKV(JCi : f_i,yi:t_i). The expression in (23) is equal to 


sup sup E y t 

*t p t e[0,l] 


n n 

E inf E yt [e{yt,y t )\~ inf J^£{f(. x t),yt) 

t=1 [ f=lyt€@ fe& s t=l 


(24) 


by an argument that can be found in [1, 13]. Here, it is understood that y t is a Bernoulli random variable with 
distribution p t . Taking the infimum outside the negative sign, the above quantity is equal to 


sup sup E yt 

*t p t €[0,l] 


f=l 


sup ^ E inf Ey f [^(yt.yt)] - E 

fa&s [f=iy ( e® f=l 


(25) 


We now claim that each infimum in (25) is achieved at yt = Tg(pt)- Indeed, this follows because the unconstrained 
minimizer over [0,1] is p t by the well-known property of entropy: 

argmin E yt [i(y t ,yt)\ = argmin | - p, log(y f ) - (1 - p t )log(l - y t )\ = p t . 

y t e[0,i] y t €[0,i] 1 ' 
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We conclude that (25) is equal to 


sup sup E yt 

X, p t e[0,l] 


n 


t= 1 


sup 


E £ yt [£(Js(.pt),yt)] - E 


t=i 


f=i 


(26) 


Now, the terms in the first sum do not depend on / e S', and thus can pass through the multiple infima and 
suprema. By linearity of expectation, (26) is equal to 

((sup sup E Vt 
\\ x, Pt e[o,i] 


sup < E^( T 5(Pf),yf)- E^t/M.yt) 

fEjf* l t= i t = t 


(27) 


We now separately deal with the case that p t £ To this end, observe that 


1 {pt < <5} {£{ts{ pt),yt ) - y f )) = 1 {p f < d, y t - 0} (£(8,y t ) - £{f{x t ),y t )) 

+ l{p t <8,y t = l\ (£{8,y t )~ £{f{x t ),y t )) 
<l{p t <8,y t = l} {£{8, 1) - £{f{x t ), 1)) 

- ~i {Pt < 8,yt = i}logd 


The first inequality is obtained by dropping the non-positive term. Indeed, p t <8 gives higher odds to the outcome 
yt — 0 than f{x t ) > 8. Positivity of £ gives the second inequality. A similar calculation gives 

l{pt>l-8} (£{{ptl,yt)~ £{f{x t ),y t )) < -1 {p t > l-8,y t = 0}logd 


Substituting into (27), we obtain an upper bound of 


sup sup Ey ( 
Xt p t e[0,l] 


t=l 


sup I E 

f^s lf=i 


1 {p t £ &} {£(t s(Pt),yt) -e{f{x t ),y t )) - 1 {p t < 8,y t = 1} logd -1 {p t > 1 - 8,y t = 0} log<5 


Since 

Ey t ~ Pl l{p t <8,y t = l}log(l/<5) <dlog(l/<5), 

and since I {p t e 3>] {£(T§{pt),yt) = 1 {pt e Si} {£{pt,yt)> we conclude that the minimax value V n {& s ) is upper 
bounded by 


sup sup Ey t 
Xt p t e[0,l] 


n 


t =1 


sup 

/E^« 


E l{Pf e®} [£{pt,yt)~ £{f[x t ),y t )) 

t= 1 


+ 2n8\og{\!8). 


(28) 


We now linearize the terms £{p t ,y t ) - £(f{xi\,yt )• The derivative of £(-,yt) at p t is 

£\pt,yt ) = -i{y t = 1} — +1 {y t = o} —. 

Pt l-pt 

Observe that the second derivative £"{-,y t ) > 1, and hence £[-,yt ) is strongly convex, for either value of y t . Strong 
convexity implies that 


£{pt<yt) - £{f{x t ),y t ) £\pt,yt ) • (p t - /(*r)) - ^(Pr-/(*r)) 2 

and thus (28) is upper bounded by 


sup sup E y t \) 

sup E ^'CPt.yt)-(Pf-/Ut))-^(Pf-/Uf )) 2 

\\ X, p t e[0,i] // f=1 



+ 2ndlog(l/<5). 


(29) 
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Observe that the derivatives are mean-zero: 


E y t ~Pt ^ (Pt.yt) - E 


-l{y t =l}-J- + l{y f = 0}— 1 
Pt i — 


Pt 


= 0, 


(30) 


which suggests that we can symmetrize these terms as in [11, 12]. The key observation is that tighter control on 
the sup remum over & will be obtained if we keep the derivatives to have a non-uniform distribution given by p t . 

Let us drop the term 2«51og(l/5) in (29) and concentrate on the first term. Consider the following upper 
bound: 


sup sup E y ( 

Xt p t e[0,l] ' II t=1 


sup E e'{pt,yt)-{p t -fixt)) 2 




< ((sup sup E y t „ p , 

x t p t ,p' t E[0,1] ,, f=1 


sup E e'{p t ,y t )-{p' t - f{x t j)--{p t - fu t )) 2 

f:p t E@ ^ 


(31) 


This upper bound holds because the supremum allows the choice p t - p' t in addition to distinct choices for the 
two distributions. 

We now pass to the tree notation. Observe that the optimal choice of x t ,p t ,p' t depends on (yi,...,y t _i) e 
{0,1} t_1 . In the functional form, let x be a sequence of mappings X|,...,x„ with the consistency property x f (yi,..., y t . 
£7(xi(y),...,x f _i(y),yi :f _i) for all yi :f _i. Similarly, let p and p be sequences of mappings with p v p t '■ 10, l] f_1 —* 
[0,1], With the same reasoning as in [13], we can write (31) as 


sup E sup 
X 'M’P /£^« 


E *"(p t(y),yt) (/if (y) - /(x f (y))) - \ (/t f (y) - /(x t (y ))] 2 

f:p t (y)£@ z 


where y t ’s in [0,1} are drawn from p. More specifically, yi ~ pi and subsequently y f ~ pt(yi-.t-i)- 

Proof of Lemma 2. 


□ 


Esup 

veV 


E i?(Pr(y),yf)vt(y) - cvf(y ) 2 

f=i 


= E inf i log 
A>0 A 


,veV 


E exp AEt?(Pt(y)-yt)vt(y)-cv t (y) : 


t= 1 


< inf - log 
A>0 A 


E E n exp (A (t?(p f (y), yf)Vf (y) - cvf (y) 2 )) 

,veV t= 1 


(32) 


Let X be a zero-mean random variable taking on a value - v/p with probability p and vl{ 1 - p) with probability 
(1 - p), where 8 < p< 1/2 and | v\ < 1. From the fact that (e x — x— l)/x 2 is a non-decreasing function and [Xj < 1/d 
almost surely, it follows that 




-AX-1<8 2 X 2 (e A/5 -A/5-lj. 

Taking expectation over X and upper bounding the variance EX 2 < 2p{ vlp) 2 < 2 v 2 18, 


Ee xx -1< 


Using 1 + x < e x , 


Applying the above derivation, 


1<2v 2 8 [e xis - A/5 - 1 j. 
Ee xx < exp |2 v 2 8 (e A/5 - A/5 - l] |. 


E [exp(A(p(pf(y),yf)v t (y) - cv t (y) 2 )) | yi,...,y f _i] = exp(-Acv t (y) 2 ) x E[exp(Ap(p t (y),yf)Vf(y)) | yi,...,y f _i] 


(-AcVf(y) 2 ) x exp|2Vf(y) 2 5(e A/5 -A/5- lj j> 

(-Df))- 


< exp 


= exp 25 v t (y) e^-1- 


i)e 


12 

















Choosing A = log(l + §)<5 we ensure that - l - (2 ^ )A j < 0 and 

E[exp[A[r](p t {y),y t )v t (y)-cv t (y) 2 )) | yi,...,y t _i] < 1. 

Iterating the argument from t-n down to t = 1 in (32), we obtain 

n 

Esup ^77(pdy),yt) v dy)-c v f (y) 2 

VEV Lf=l 

The case when p is [0, l]-valued, but the summation is taken only over {t : p, (y) e [<5,1 - 5]}, follows immediately 
through the same argument. □ 


log W\ 
51og(l+§)' 


Proof of Lemma 3. Both sides of the inequality in the statement of the Lemma are homogenous with respect to 
fmax, and so we can assume v max - 1 and rescale the problem. We have 


E v max 

1 veV 


L r i ( Pr(y)-yf)w(y) 


t= 1 


< inf 
A>0 


< inf 

A>0 


jlog^ Eex p(A^77(p t (y),y f ) v f(y))i 

A veV V f=i I) 

log|V[ 1 I n 

—-— + max - log E exp A V 77 (p f (y), y t ) v t (y) 
A vev A \ 


(33) 


As shown in the proof of Lemma 2, if X is a zero-mean random variable taking on a value -v/p with probability p 
and ul[ 1 - p ) with probability (1 - p), where 8 < p < 1/2 and | v\ < 1, then 

logEe AZ < 2v 2 8(f>{A/8) 


where cf>{x ) - e x - x-\. Hence, 


exp A^77(p f (y),y t )Vf(y) 

t=i 


yi.....y n -i 


< exp 


< exp 


' n —1 

A E ^(pt(y).yr)vr(y) | x E[exp(A77(p„(y),yn)v„(y)) |yi,...,y„-i] 

i f=i 
n -1 


a E J 7(pt(y)>yt) v f(y) 

t=i 


x exp < 28<p{AIS) maxv„(y)' 

y n -1 


For y n - 1 , we proceed in a similar fashion: 


' n -1 


exp AE i7(Pr(y)»yf)v f (y) 

1 t=i 
«-2 

<exp A E »7(Pf(y).yt)Vf(y) 

t=i 

r n-2 ' 

<exp A E »7(Pf(y).yt)Vf(y) 

1 t=i 
r n-2 

<exp A E i7(Pt(y),yt)v t (y) 

t=i 


x exp < 28(f>{AI8) maxv„(yy 

( yn -1 


yi,...,y n -2 


xE 


exp <{ Apipn-i (y), y„_i)v„_i (y) + 28(p{AI8) maxv„(y)' 

y«-i 


yt,....y«-2 


x exp 1 28(p{AI8)v n -\ (y) 2 + 28(p{AI8) maxv„(y)" 

( yn -1 


x exp < 28(p(AI8) max {v„_i(y ) 2 + v„(y) 2 } 

( yn-2,yn-l 


Unrolling the expression to t — 1 we obtain 


logE 


exp AE??(Pt(ybyt)Vf(y) 


t=i 


< 28(p{A!8)nv 


.-.2 


where n 2 = max vf y max y A Z” =1 v f (y) 2 - In view of (33), we get 


E v max 

7 veV 


E 7 7{PfO')>yt) v r(y) 


f=l 


, infill 

A>0 l A 


28(p{AI8)nv 2 

A 


(34) 
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First, consider the case 5 > Then the choice A - | \J ensures A < 5. In this case, c/>(A/5) can be upper 
bounded by a quadratic 0(A/5) < (A IS) 2 ■ e. The upper bound in (34) becomes 

log|V| 2e(A IS) 2 Snv 2 I nv 2 log|V| 

— + -A- S(2+C) V —5 —■ 

On the other hand, if 5 < '° g ^ , the upper bound in (34) becomes 

inf ( iog|y| | 2io g |y[0(A/g) | 

A>o 1 ^ 4A j 

Choosing A = 5 yields an upper bound of 21o | |t 1 . Combining the two cases, we arrive at the statement of the 
Lemma. □ 


Proof of Theorem 4. Let us use the shorthand 3 — [5,1-5]. Let V' be a sequential y-cover of 9? on z in the 
sense, i.e. 

Vye{0,l}", VgE^, 3v e V' s.t. |g(z t (y))-v f (y)| <y. 

Of course, an cover is also an £2 cover at the same scale. Let us augment V' to include the all-zero tree, and 
denote the resulting set by V — V' u {0}. Denote by v[e, g] a y-close tree promised above. We have 


Esup 

ge'S 


E ri^dy),yt)g^y))-Kg{z t (y )) 2 

f:p,(y)E@ 


: Esup 


< Esup 

ge® 


+ Emax 

veV 1 


f:p t (y)E@ 


E r l(Pt(y),yt)[g{z t (y))-v[y,g]t(y)}-K[g(z t {y)f -^v[y,g] t (y) 2 } 
(»7(PfOO, Tf)v[y, g] t (y) - jv[y, g] f (y) 2 ) ] 

E r l(pdy),yt) (g(z r (y)) - v[y, g] t (y)) - (g(Zf (y)) 2 - ^-v[y, g] t (y) 2 ] 


f:p t (y)E@ 




E ^(Pt (y). yr)vt(y) - — v t (y ) 2 

f:p,Cy)E@ 4 


(35) 

(36) 

(37) 

(38) 

(39) 


We now claim that for any y and g there exists an element v[y, g] e 1/ such that 

^ ft 

Eg(z f (y)) 2 >-E v iy-gMy) 2 ( 40 ) 

t=t 4 f=t 

and so we can drop the corresponding negative term in the supremum over C S. First consider the easy case 
X” = i g(z ( (e)) 2 £ y 2 . Then we may choose 0 e V as a tree that provides a sequential y-cover in the £2 sense. Clearly, 
(40) is then satisfied with this choice of v[e, g] = 0. Now, assume ^ Z” =1 g(z, (e)) 2 > y 2 . Fix any tree v[e, g| e V that 
is y-close in the £2 sense to g on the path e. Denote u = (v[e,g]i(e),...,v[£',g] n (e)) and h = (g(zi(e)),...,g(z n (e))). 
Thus, we have that || u - h || < y and || h || > y for the norm [| h || 2 = ^ X "L j h 2 . Then 

II w|| < || m - h|| + || h|| < y + || h|| < 2|| h|| 

and thus || h || > | II m|| as desired. We conclude that 


Esup E ^(Pf(y).yt)g(zt(y))-^g(zt(y)) 2 

f:pt(y)£@ 


< Esup 

g£» 


E fttPf (y)> yt)(gCz r (y)) - v[y, g] t (y)) 

t:p t (y)e® 


+ Emax 

VEW 


E fttPf (y). y f )v t (y) - — Vf (y ) 2 
f:p t Cy)£® 


( 41 ) 
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By Lemma 2, the second term is upper bounded by 

log ^o(^,y,z) 

51og(l + f) 

As for the second term, we note that conditionally on the random variable i)(pt (y), y ; ) is zero-mean. 

Let us proceed with the chaining technique. To this end, let v[g, y] J e V j he an element of a y2~ ] -cover of g e c S. 


Esup 

g€<S 
N 

< Y Esup 
]=1 


E v ( p t O') > yt ) (g (z t (y)l - v[y, g] t (y)) 

f:p t (y)e@ 


E t7Cp r (y),y f )(v[y,^r]^Cy) - v ly, 1 (y)j 


+ Esup 

get? 


E t7(pt(y). y^(g(zt(y)) - v[y, gif(yi) 

f:p t (y)e® 


(42) 

(43) 

(44) 


For the last term we use the Cauchy-Schwartz inequality: for any y and ge^, 

1/2 


\ 1/2 / 2 \l/2 

E t7(pt(y).yti[g(zt(y))-v[y,g]f(y)) < E ^(p^yi-y/) 2 E (g(zr(y))-v[y,g]f(y)l (45) 

f:p t (y)e@ U:Pt(y)e@ \,/:p t (y)E@ ] 


< — ny2 
o 


-iV 


(46) 


Further, for any j = 1 
Esup 


E t?(Pt(y),y/)(v[y,g]{(y)-v[y,gf x (y)) 

f:p t(y)e@ 


< E max 

we W/ 


E t/(p f (y),y f )w f (y) 

f:p t (y)e@ 


where W-l is defined as the set of difference trees, defined as follows. For each pair v' e W,v" e W t ; i e t w be 
defined for each path (yi,...,y n ) e {0,1}” and t e {!,...,«} as 


Wf (y) = 


v fy)- v '/(y)> if exists (yf ■ ■ ■ ,yf) s.t. 3ge^ s.t. v' =v[g,y]f v" = v[g,yf \y = (yi,...,y t _i,y , t ,...,y{ 1 ) 
0 otherwise 


In other words, w is defined for each element of the tree as the difference between two trees if there is continuation 
of the path on which the two trees are indeed covering elements for some g£^, and 0 if no such continuation 
exists. Then W ] is defined as the collection of all such trees w obtained by pairing up all choices of trees from V 1 
and Vi-\ Clearly, the size \W>\ < \V >\x \ Vi~ l \ < |W| 2 . 

We now use the result of Lemma 3: 


Emax E t?(Pt(y),yr)w f (y) 
welW fp f (y)e@ 


< 5v 


log 1 W> | | 2 Umax log I W 1 \ 


(47) 


with v = max Wi y(Xf = i w ( (y) 2 ) 1/2 and v max - max w y |w t (y)|. We over-bound v by i/ max in the arguments below. By 
construction of each w e W 1 , the 7 '2 norm along any path is upper bounded by 3^/ny2 _; (see [14]). We conclude 


that 


Emax E t?(Pf(y)<y/)Wf(y) 

W€WJ [f:p,(y)e® 



\/log[W| + 


4(y2-l)log|W| 

S 


Observe that 


N 


E r2~ ] 

j =1 


N 


log | vi | = 2E (r2~ j -y2- (J ' +1) )Viog^oo(^,r2-hz) 
y=l 


<2 f J\o%J^s7p, 

J r2 -(JV+i) V 


z )dp 


(48) 

(49) 


15 




























and similarly 


n rr 

V y2 Jlogin <2 logJZ 00 (<g,p,z)dp. (50) 

; =1 Jy 2 -( n+1 > 


Fix a e (0, y) and let N = max) /: y2 J > 2a}. Then y2 tJV+1) < 2 a and y2 N < 4a. Combining all the bounds, 


Esup 

g£® 


E t7(pdyb yt) (g(zr(y)) - v[y, g] t (y)) 


f:p f (y)e@ 


f 4na [2n H /-;- 8 rr 

inf < —— +30\/— / J\o%Jf 00 } c S,p,z)dp + -\ log p,z)dp 

a(0, T ] { 0 V S J a V 0 J a 


The statement of the theorem follows by combining the two upper bounds for (41). 


(51) 


(52) 


□ 


Proof of Theorem 6. The proof closely follows the one in [14, Thm. 4], and we refer to that paper for the missing 
details. Define the function gk{d, n ) = Zf =0 for n > 1 and d > 0, and note the recursion 


g k {d,n-\) + kg k {d-\,n-\) = g k (d,n). 


We proceed by induction on ( n,d ). The base of the induction is the same as in the proof of [14, Thm. 4]. For 
the induction step, fix an ordered ^-valued tree x of depth n and suppose fat!)(,)?,.)£') = d. Define the parti¬ 
tion & — ujL yS'i according to = {/ : /(xi) = i). For the sake of contradiction, suppose fatjt-^/, lmg(x)) = 
fatjGiy, lmg(x)) = d for some j - i > 2. Then there exists two lmg(x)-valued ordered trees w and v of depth 
d that are 2-shattered by ^ and , respectively. Crucially, xi cannot appear in either of these trees (that is, 
xi C lmg(w) u lmg(v)) because functions in (resp., &f) are constant on xi. Furthermore, xi < a for any a e 
lmg(w) u lmg(v). Hence, by joining wand v with xi at the root, we obtain an ordered tree which is now 2-shattered. 
The witness of this shattering is constructed by joining the two witnesses (for w and v) and (i + j)!2 at the root. 
This leads to a contradiction. The rest of the proof follows exactly as in [14, Thm. 4]. □ 

Proof of Lemma 7. The gradient of g t at w t is 


Vg r (Wf) = — 1 {y f = 1} 


x t 

1 + {w t ,x t ) 


+1 {yt — o} 


x t 

\-{w t ,x t ) 


and the Hessian of the barrier as 


V*R(wt) = 


-=■/- 1 - 7rzU>tiv:. 

1 - II iUt\\ 2 (i-[|iwll 2 ) 2 f 


By rotational invariance, for the following calculation we may assume without loss of generality that w t - ac\ is in 
the direction of the basis vector ei and a > 0. We can then write the inverse (see [2]) as 


V 2 i?(u/ t r 1 = i( 


Consider the case y t - 0 (the analysis for y t - 1 follows the same lines). Let us write x t - be i + y with (y, ei) = 0 and 
llyll 2 < 1 - b 2 . We have 

x t be i+ y 

Vg f (w f ) = 


and 


Vg t {w t ) J V 2 R{u> t ) l Vg t {w t )< 


l-(w t ,x t ) 
b 2 


(1 - ab) 2 


1 - ab 


■ (1 - a) + 


1 -b 2 
(1 - ab) 2 


• (1 -a) 


If b < 0, the above expression is upper bounded by 2, and for b> 0, the expression is upper bounded by 3 (we did 
not optimize the constants). □ 
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Proof of Lemma 9. We reproduce the proof from [8] for completeness. Let x e int JC and r e [0,1). Let 


y = x- 7 ■ l|t [V 2 F(x)]- 1 VyW- 

Then y e intJK because the Dikin ellipsoid is contained in the set. Hence, 

0 < 1 i/dy) < y/(x) + (Vi fr[x),y-x) = yr(x) - r||V i^(jc)||*. 

Statement follows because r is arbitrary in [0,1). □ 


Acknowledgements 

We gratefully acknowledge the support of NSF under grants CAREER DMS-0954737 and CCF-1116928, as well as 
Dean’s Research Fund. 


References 

[1] J. Abernethy, A. Agarwal, P. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through minimax 
duality. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009. 

[2] J. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability. In COLT, 2009. 

[3] N. Cesa-Bianchi and G. Lugosi. Minimax regret under log loss for general classes of experts. In Proceedings of 
the Twelfth annual conference on computational learning theory, pages 12-18. ACM, 1999. 

[4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. 

[5] Y. Freund. Predicting a binary sequence almost as well as the optimal biased coin. In Proceedings of the ninth 
annual conference on Computational learning theoiy, pages 89-98. ACM, 1996. 

[6] E. Hazan, A. Agarwal, and S Kale. Logarithmic regret algorithms for online convex optimization. Machine 
Learning, 69(2-3): 169-192, 2007. 

[7] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory, 44:2124-2147, 1998. 

[8] Y. Nesterov. Barrier subgradient method. Mathematical programming, 127(1):31—56, 2011. 

[9] A. Rakhlin, O. Shamir, and K. Sridharan. Relax and randomize: From value to algorithms. In Advances in 
Neural Information Processing Systems 25, pages 2150-2158, 2012. 

[10] A. Rakhlin and K. Sridharan. Statistical learning and sequential prediction, 2012. Available at 
http://stat.wharton.upenn.edu/'rakhlin/courses/stat928/stat928_notes.pdf . 

[11] A. Rakhlin and K. Sridharan. Online nonparametric regression. In Conference on Learning Theory, 2014. 

[12] A. Rakhlin and K. Sridharan. Online nonparametric regression with general loss functions, 2015. Available at 
http://arxiv.org/abs/1501.06598. 

[13] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters, and 
learnability. Advances in Neural Information Processing Systems 23, pages 1984-1992, 2010. 

[14] A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martingale laws of large num¬ 
bers. Probability Theory and Related Fields, February 2014. 

[15] J. Rissanen. Complexity of strings in the class of markov sources. Information Theoiy, IEEE Transactions on, 
32(4):526-532, 1986. 


17 



[16] J. Rissanen. Fisher information and stochastic complexity. Information Theory, IEEE Transactions on, 
42(l):40-47, 1996. 

[17] Y. M. Shtarkov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3—17, 
1987. 

[18] Q. Xie and A.R. Barron. Asymptotic minimax regret for data compression, gambling, and prediction. Infor¬ 
mation Theory, IEEE Transactions on, 46(2):431-445, 2000. 


18 


